HR ANALYTICS EMPLOYEE ATTRITION AND PERFORMANCE

BCon 147: special topics

Author

Agnes S. Campehios

Published

October 24, 2024

1 Project overiew

In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.

2 Scenario

Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.

Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.

3 Understanding data source

The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.

This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.

## datatable function from DT package create an HTML widget display of the dataset

## install DT package if the package is not yet available in your R environment
 #data <- readxl::read_excel("dataset/dataset-variable-description.xlsx#")
##DT::datatable(data)

4 Data wrangling and management

Libraries

Task: Load the necessary libraries

Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.

library(tidyverse)
library(readxl)
library(janitor)
library(lubridate)
library(tidytext)
library(readr)
library(haven)
library(dplyr)
library(skimr)
library(ggplot2)
library(magrittr)
library(DT)
library(GGally)
library(sjPlot)
library(rlang)
library(DT)
library(htmltools)
library(knitr)
library(rmarkdown)
library(fastmap)
library(ggstatsplot)
library(report)

4.1 Data importation

Task 4.1. Merging dataset
  • Import the two dataset Employee.csv and PerformanceRating.csv. Save the Employee.csv as employee_dta and PerformanceRating.csv as perf_rating_dta.

  • Merge the two dataset using the left_join function from dplyr. Use the EmployeeID variable as the varible to join by. You may read more information about the left_join function here.

  • Save the merged dataset as hr_perf_dta and display the dataset using the datatable function from DT package.

## import the two data here
employee_dta <- read_csv("dataset/Employee.csv")
perf_rating_dta <- read_csv("dataset/PerformanceRating.csv")


## merge employee_dta and perf_rating_dta using left_join function.
merged_data <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")

## save the merged dataset as hr_perf_dta
hr_perf_dta <- merged_data |> na.omit()


## Use the datatable from DT package to display the merged dataset
datatable(hr_perf_dta)

4.2 Data management

Task 4.2. Standardizing variable names
  • Using the clean_names function from janitor package, standardize the variable names by using the recommended naming of variables.

  • Save the renamed variables as hr_perf_dta to update the dataset.

## clean names using the janitor packages and save as hr_perf_dta
library(janitor)
library(DT)
hr_perf_dta <- hr_perf_dta %>% clean_names()

## display the renamed hr_perf_dta using datatable function
datatable(hr_perf_dta)
Task 4.2. Recode data entries
  • Create a new variable cat_education wherein education is 1 = No formal education; 2 = High school; 3 = Bachelor; 4 = Masters; 5 = Doctorate. Use the case_when function to accomplish this task.

  • Similarly, create new variables cat_envi_sat, cat_job_sat, and cat_relation_sat for environment_satisfaction, job_satisfaction, and relationship_satisfaction, respectively. Re-code the values accordingly as 1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; and 5 = Very satisfied.

  • Create new variables cat_work_life_balance, cat_self_rating, cat_manager_rating for work_life_balance, self_rating, and manager_rating, respectively. Re-code accordingly as 1 = Unacceptable; 2 = Needs improvement; 3 = Meets expectation; 4 = Exceeds expectation; and 5 = Above and beyond.

  • Create a new variable bi_attrition by transforming attrition variable as a numeric variabe. Re-code accordingly as No = 0, and Yes = 1.

  • Save all the changes in the hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.

## create cat_education
colnames(hr_perf_dta)
 [1] "employee_id"                        "first_name"                        
 [3] "last_name"                          "gender"                            
 [5] "age"                                "business_travel"                   
 [7] "department"                         "distance_from_home_km"             
 [9] "state"                              "ethnicity"                         
[11] "education"                          "education_field"                   
[13] "job_role"                           "marital_status"                    
[15] "salary"                             "stock_option_level"                
[17] "over_time"                          "hire_date"                         
[19] "attrition"                          "years_at_company"                  
[21] "years_in_most_recent_role"          "years_since_last_promotion"        
[23] "years_with_curr_manager"            "performance_id"                    
[25] "review_date"                        "environment_satisfaction"          
[27] "job_satisfaction"                   "relationship_satisfaction"         
[29] "training_opportunities_within_year" "training_opportunities_taken"      
[31] "work_life_balance"                  "self_rating"                       
[33] "manager_rating"                    
hr_perf_dta <- hr_perf_dta %>% mutate(cat_education = case_when(education == 1 ~ "No formal education", education == 2 ~ "High school", education == 3 ~ "Bachelor", education == 4 ~ "Masters", education == 5 ~ "Doctorate",TRUE ~ NA_character_ ))



## create cat_envi_sat,  cat_job_sat, and cat_relation_sat
hr_perf_dta <- hr_perf_dta %>% mutate(cat_envi_sat = case_when(
    environment_satisfaction == 1 ~ "Very dissatisfied",
    environment_satisfaction == 2 ~ "Dissatisfied",
    environment_satisfaction == 3 ~ "Neutral",
    environment_satisfaction == 4 ~ "Satisfied",
    environment_satisfaction == 5 ~ "Very satisfied",
    TRUE ~ NA_character_
  )) %>%
  
  # Recode job satisfaction
  mutate(cat_job_sat = case_when(
    job_satisfaction == 1 ~ "Very dissatisfied",
    job_satisfaction == 2 ~ "Dissatisfied",
    job_satisfaction == 3 ~ "Neutral",
    job_satisfaction == 4 ~ "Satisfied",
    job_satisfaction == 5 ~ "Very satisfied",
    TRUE ~ NA_character_
  )) %>%
  
  # Recode relationship satisfaction
  mutate(cat_relation_sat = case_when(
    relationship_satisfaction == 1 ~ "Very dissatisfied",
    relationship_satisfaction == 2 ~ "Dissatisfied",
    relationship_satisfaction == 3 ~ "Neutral",
    relationship_satisfaction == 4 ~ "Satisfied",
    relationship_satisfaction == 5 ~ "Very satisfied",
    TRUE ~ NA_character_))
  
datatable(hr_perf_dta)
## create cat_work_life_balance, cat_self_rating, and cat_manager_rating
hr_perf_dta <- hr_perf_dta %>% mutate(cat_work_life_balance = case_when(
    work_life_balance == 1 ~ "Unacceptable",
    work_life_balance == 2 ~ "Needs improvement",
    work_life_balance == 3 ~ "Meets expectation",
    work_life_balance == 4 ~ "Exceeds expectation",
    work_life_balance == 5 ~ "Above and beyond",
    TRUE ~ NA_character_
  )) %>%
  
  # Recode self-rating
  mutate(cat_self_rating = case_when(
    self_rating == 1 ~ "Unacceptable",
    self_rating == 2 ~ "Needs improvement",
    self_rating == 3 ~ "Meets expectation",
    self_rating == 4 ~ "Exceeds expectation",
    self_rating == 5 ~ "Above and beyond",
    TRUE ~ NA_character_
  )) %>%
  
  # Recode manager rating
  mutate(cat_manager_rating = case_when(
    manager_rating == 1 ~ "Unacceptable",
    manager_rating == 2 ~ "Needs improvement",
    manager_rating == 3 ~ "Meets expectation",
    manager_rating == 4 ~ "Exceeds expectation",
    manager_rating == 5 ~ "Above and beyond",
    TRUE ~ NA_character_
  ))
  
datatable(hr_perf_dta)
## create bi_attrition
hr_perf_dta <- hr_perf_dta %>%
  mutate(bi_attrition = if_else(attrition == "Yes", 1, 0))
datatable(hr_perf_dta)
## print the updated hr_perf_dta using datatable function
datatable(hr_perf_dta)

5 Exploratory data analysis

5.1 Descriptive statistics of employee attrition

Task 5.1. Breakdown of attrition by key variables
  • Select the variables attrition, job_role, department, age, salary, job_satisfaction, and work_life_balance. Save as attrition_key_var_dta.

  • Compute and plot the attrition rate across job_role, department, and age, salary, job_satisfaction, and work_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use the count function to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation as pct_attrition. Do not forget to ungroup before storing the output. Store the output as attrition_rate_job_role.

  • Plot for the attrition rate across job_role has been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!

## selecting attrition key variables and save as `attrition_key_var_dta`
attrition_key_var_dta <- hr_perf_dta %>%
  select(attrition, job_role, department, age, salary, job_satisfaction, work_life_balance)



## compute the attrition rate across job_role and save as attrition_rate_job_role
attrition_rate_job_role <- hr_perf_dta %>%
  group_by(job_role) %>%
  summarise(
    total = n(),
    attrition_count = sum(bi_attrition),
    pct_attrition = attrition_count / total
  ) %>%
  ungroup()

## print attrition_rate_job_role
print(attrition_rate_job_role)
# A tibble: 13 × 4
   job_role                  total attrition_count pct_attrition
   <chr>                     <int>           <dbl>         <dbl>
 1 Analytics Manager           208              28        0.135 
 2 Data Scientist             1360             597        0.439 
 3 Engineering Manager         302              18        0.0596
 4 HR Business Partner          23               0        0     
 5 HR Executive                115              29        0.252 
 6 HR Manager                   16               0        0     
 7 Machine Learning Engineer   558              95        0.170 
 8 Manager                     140              19        0.136 
 9 Recruiter                   149              86        0.577 
10 Sales Executive            1521             543        0.357 
11 Sales Representative        489             317        0.648 
12 Senior Software Engineer    494              84        0.170 
13 Software Engineer          1334             445        0.334 
## compute the attrition rate across department and save as attrition_rate_department
attrition_rate_department <- hr_perf_dta %>%
  group_by(department) %>%
  summarise(
    total = n(),
    attrition_count = sum(bi_attrition),
    pct_attrition = attrition_count / total
  ) %>%
  ungroup()

## print attrition_rate_job_role
print(attrition_rate_department) 
# A tibble: 3 × 4
  department      total attrition_count pct_attrition
  <chr>           <int>           <dbl>         <dbl>
1 Human Resources   303             115         0.380
2 Sales            2149             879         0.409
3 Technology       4257            1267         0.298
## compute the attrition rate across age and save as attrition_rate_age
attrition_rate_age <- hr_perf_dta %>%
  group_by(age) %>%
  summarise(
    total = n(),
    attrition_count = sum(bi_attrition),
    pct_attrition = attrition_count / total
  ) %>%
  ungroup()

## print attrition_rate_job_role
print(attrition_rate_age) 
# A tibble: 34 × 4
     age total attrition_count pct_attrition
   <dbl> <int>           <dbl>         <dbl>
 1    18    36              36         1    
 2    19    82              72         0.878
 3    20   129             117         0.907
 4    21   223             162         0.726
 5    22   314             218         0.694
 6    23   261              94         0.360
 7    24   466             181         0.388
 8    25   596             205         0.344
 9    26   542             239         0.441
10    27   406             120         0.296
# ℹ 24 more rows
## compute the attrition rate across salary and save as attrition_rate_salary
attrition_rate_salary <- hr_perf_dta %>%
  group_by(salary) %>%
  summarise(
    total = n(),
    attrition_count = sum(bi_attrition),
    pct_attrition = attrition_count / total
  ) %>%
  ungroup()

## print attrition_rate_job_role
print(attrition_rate_salary) 
# A tibble: 1,271 × 4
   salary total attrition_count pct_attrition
    <dbl> <int>           <dbl>         <dbl>
 1  20387    10              10             1
 2  20583     1               0             0
 3  20650    10              10             1
 4  20802     1               0             0
 5  21158     1               0             0
 6  21344     1               0             0
 7  21350    10              10             1
 8  21649     2               0             0
 9  21854     9               9             1
10  22089    10              10             1
# ℹ 1,261 more rows
## compute the attrition rate across job_satisfaction and save as attrition_rate_job_satisfaction
attrition_rate_job_satisfaction <- hr_perf_dta %>%
  group_by(job_satisfaction) %>%
  summarise(
    total = n(),
    attrition_count = sum(bi_attrition),
    pct_attrition = attrition_count / total
  ) %>%
  ungroup()

## print attrition_rate_job_role
print(attrition_rate_job_satisfaction) 
# A tibble: 5 × 4
  job_satisfaction total attrition_count pct_attrition
             <dbl> <int>           <dbl>         <dbl>
1                1   130              36         0.277
2                2  1674             549         0.328
3                3  1651             568         0.344
4                4  1685             573         0.340
5                5  1569             535         0.341
## compute the attrition rate across work_life_balance and save as attrition_rate_work_life_balance
attrition_rate_work_life_balance <- hr_perf_dta %>%
  group_by(work_life_balance) %>%
  summarise(
    total = n(),
    attrition_count = sum(bi_attrition),
    pct_attrition = attrition_count / total
  ) %>%
  ungroup()

## print attrition_rate_job_role
print(attrition_rate_work_life_balance) 
# A tibble: 5 × 4
  work_life_balance total attrition_count pct_attrition
              <dbl> <int>           <dbl>         <dbl>
1                 1   121              37         0.306
2                 2  1702             568         0.334
3                 3  1670             580         0.347
4                 4  1706             560         0.328
5                 5  1510             516         0.342
## Plot the attrition rate of Job Role
ggplot(attrition_rate_job_role, aes(x = reorder(job_role, pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#CDB4DB", color = "#d5aca9") +
  geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) + 
  labs(title = "Attrition Rate by Job Role",
       x = "Job Role",
       y = "Attrition Rate (%)") +
  ylim(0, 1) +  # Set y-axis limit 
  theme_bw() +  # Using a theme 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "darkblue"),  # Rotate and style x-axis labels
        plot.title = element_text(hjust = 0.5, face = "bold", color = "violet"),  
        plot.margin = unit(c(1, 1, 1, 1.5), "cm"))  

## Plot the attrition rate of Department
ggplot(attrition_rate_department, aes(x = reorder(department, pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#FFC8DD", color = "#b7094c") +
  geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) + 
  labs(title = "Attrition Rate by Department",
       x = "Department",
       y = "Attrition Rate (%)") +
  ylim(0, 1) +  # Set y-axis limit 
  theme_bw() +  # Using a theme 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "#5f634f"),  # Rotate and style x-axis labels
        plot.title = element_text(hjust = 0.5, face = "bold", color = "#985277"),  
        plot.margin = unit(c(1, 1, 1, 1.5), "cm"))

## Plot the attrition rate of Age
library(dplyr)
library(ggplot2)

# Simplified computation of attrition rate by age group
attrition_rate_age <- hr_perf_dta %>%
  mutate(age_group = cut(age, breaks = c(20, 30, 40, 50, 60, 70), 
                         labels = c("20-29", "30-39", "40-49", "50-59", "60-69"))) %>%
  group_by(age_group) %>%
  summarise(
    total_employees = n(),
    pct_attrition = mean(bi_attrition == 1, na.rm = TRUE) * 100)

# Plotting attrition rate by age group
ggplot(attrition_rate_age, aes(x = age_group, y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#f9035e") +
  geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5) +
  labs(title = "Attrition Rate by Age Group", x = "Age Group", y = "Attrition Rate (%)") +
  theme_minimal()

## Plot the attrition rate of Salary

# Simplified computation of attrition rate by salary range
attrition_rate_salary <- hr_perf_dta %>%
  mutate(salary_range = cut(salary, breaks = c(0, 50000, 100000, 150000, 200000), 
                            labels = c("0-50k", "50k-100k", "100k-150k", "150k+"))) %>%
  group_by(salary_range) %>%
  summarise(
    total_employees = n(),
    pct_attrition = mean(bi_attrition == 1, na.rm = TRUE) * 100)

# Plotting attrition rate by salary range
ggplot(attrition_rate_salary, aes(x = salary_range, y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#ffcaa6", color = "#f86594") +
  geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5) +
  labs(title = "Attrition Rate by Salary Range", x = "Salary Range", y = "Attrition Rate (%)") +
  theme_minimal()

## Plot the attrition rate of Job Satisfaction
ggplot(attrition_rate_job_satisfaction, aes(x = reorder(job_satisfaction, pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#f9c58d", color = "#f492f0") +
  geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) + 
  labs(title = "Attrition Rate by Job Satisfaction",
       x = "Job Satisfaction",
       y = "Attrition Rate (%)") +
  ylim(0, 1) +  # Set y-axis limit 
  theme_bw() +  # Using a theme 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "black"),  # Rotate and style x-axis labels
        plot.title = element_text(hjust = 0.5, face = "bold", color = "#a18dce"),  
        plot.margin = unit(c(1, 1, 1, 1.5), "cm"))  

## Plot the attrition rate of Work Life Balance
ggplot(attrition_rate_work_life_balance, aes(x = reorder(work_life_balance, pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "#439cfb", color = "#ebf4f5") +
  geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) + 
  labs(title = "Attrition Rate by Work Life Balance",
       x = "Work Life Balance",
       y = "Attrition Rate") +
  ylim(0, 1) +  # Set y-axis limit 
  theme_bw() +  # Using a theme 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "black"),  # Rotate and style x-axis labels
        plot.title = element_text(hjust = 0.5, face = "bold", color = "#42047e"),  
        plot.margin = unit(c(1, 1, 1, 1.5), "cm"))  

5.2 Identifying attrition key drivers using correlation analysis

Task 5.2. Conduct a correlation analysis to identify key drivers
  • Conduct a correlation analysis of key variables: bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Use the cor() function to run the correlation analysis. Remove missing values using the na.omit() before running the correlation analysis. Save the output in hr_corr.

  • Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the GGally package and use the ggcorr function to visualize the correlation heatmap. You may explore this site for more information: ggcorr.

  • Discuss which factors seem most correlated with attrition and what that suggests aobut why employees are leaving.

## conduct correlation of key variables.
hr_corr <- 
hr_perf_dta |> select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance) |> 
na.omit() |> 
cor()


## print hr_corr 

print(hr_corr)
                  bi_attrition       salary years_at_company job_satisfaction
bi_attrition       1.000000000 -0.211181478    -0.6896527798     0.0132368129
salary            -0.211181478  1.000000000     0.2206442116     0.0053054850
years_at_company  -0.689652780  0.220644212     1.0000000000     0.0008700583
job_satisfaction   0.013236813  0.005305485     0.0008700583     1.0000000000
manager_rating    -0.007654429 -0.001596736     0.0178656879    -0.0158205481
work_life_balance  0.003428836 -0.001517145     0.0079339508     0.0417242942
                  manager_rating work_life_balance
bi_attrition        -0.007654429       0.003428836
salary              -0.001596736      -0.001517145
years_at_company     0.017865688       0.007933951
job_satisfaction    -0.015820548       0.041724294
manager_rating       1.000000000       0.007996938
work_life_balance    0.007996938       1.000000000
## install GGally package and use ggcorr function to visualize the correlation
if (!require(GGally))
  install.packages("GGally");library(GGally)

library(GGally)
library(ggplot2)
library(scales)
library(reshape2)

corr_matrix <- cor(hr_corr, use = "complete.obs")
melted_corr <- melt(corr_matrix)
ggplot(data = melted_corr, aes(Var1, Var2, fill = value)) + 
  geom_tile() +
  scale_fill_gradient2(low = "#f72585", high = "#ef709b", mid = "white", 
                       midpoint = 0, limit = c(-1,1), name="Correlation") +
  geom_text(aes(Var1, Var2, label = round(value, 2)), color = "black") + 
  ggtitle("Correlation Heatmap of Key Variables") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", color = "#b7094c"),
        axis.text.y = element_text(hjust = 1, vjust = 0.5),  
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))

Discussion:

*Years at the company has the strongest impact on whether employees stay or leave. The longer someone has been with the company, the less likely they are to leave.

*Salary is positively correlated with attrition, which is unexpected. This suggests that even employees with higher salaries are leaving, possibly because they’re receiving better offers elsewhere or because money isn’t enough to keep them.

*Job satisfaction, manager rating, and work-life balance don’t seem to have much effect on whether employees leave, as their correlations with attrition are very weak or nonexistent.

The findings suggest that employee loyalty (tenure) plays a significant role in retention, while salary, despite being positively correlated with attrition, may need to be explored further to understand whether it’s a motivator for leaving.

5.3 Predictive modeling for attrition

Task 5.3. Predictive modeling for attrition
  • Create a logistic regression model to predict employee attrition using the following variables: salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Save the model as hr_attrition_glm_model. Print the summary of the model using the summary function.

  • Install the sjPlot package and use the tab_model function to display the summary of the model. You may read the documentation here on how to customize your model summary.

  • Also, use the plot_model function to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.

  • Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.

# Clean the data by removing missing values and selecting the relevant columns

## run a logistic regression model to predict employee attrition
hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction + manager_rating + work_life_balance, 
                              data = hr_perf_dta, 
                              family = binomial)



## print the summary of the model using the summary function
summary(hr_attrition_glm_model)

Call:
glm(formula = bi_attrition ~ salary + years_at_company + job_satisfaction + 
    manager_rating + work_life_balance, family = binomial, data = hr_perf_dta)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        2.571e+00  2.173e-01  11.831   <2e-16 ***
salary            -3.633e-06  4.086e-07  -8.893   <2e-16 ***
years_at_company  -6.333e-01  1.476e-02 -42.919   <2e-16 ***
job_satisfaction   3.470e-02  3.186e-02   1.089    0.276    
manager_rating     5.071e-03  3.810e-02   0.133    0.894    
work_life_balance  2.587e-02  3.198e-02   0.809    0.419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8574.5  on 6708  degrees of freedom
Residual deviance: 4781.6  on 6703  degrees of freedom
AIC: 4793.6

Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model
if (!require(GGally))
  install.packages("sjPlot");library(sjPlot)
library(sjPlot)

hr_attrition_glm_model <- glm(bi_attrition ~ salary + years_at_company + job_satisfaction + manager_rating + work_life_balance, 
                              data = hr_perf_dta, 
                              family = binomial)

tab_model(hr_attrition_glm_model, show.ci = TRUE, show.se = TRUE,  transform = NULL)
  bi attrition
Predictors Log-Odds std. Error CI p
(Intercept) 2.57 0.22 -Inf – Inf <0.001
salary -0.00 0.00 -Inf – Inf <0.001
years at company -0.63 0.01 -Inf – Inf <0.001
job satisfaction 0.03 0.03 -Inf – Inf 0.276
manager rating 0.01 0.04 -Inf – Inf 0.894
work life balance 0.03 0.03 -Inf – Inf 0.419
Observations 6709
R2 Tjur 0.502
## use plot_model function to visualize the model coefficients
plot_model(hr_attrition_glm_model, type = "est", show.values = TRUE, show.p = TRUE, title = "Model Coefficients for Employee Attrition", vline.color = "red", value.offset = .3)

Discussion:

The logistic regression model provides insights into factors affecting employee attrition, but with some unexpected results. The most important finding is that the longer someone works at a company, the less likely they are to quit. Surprisingly, factors like job satisfaction and work-life balance show a slight increase in attrition risk, which is counterintuitive. Salary appears to have no effect, which is also unexpected. Manager ratings have minimal impact on attrition. These results suggest that employee retention is more complex than commonly assumed. While years at the company clearly matter, other traditional factors may not work as straightforwardly as thought. This analysis highlights the need for a deeper look into the data and possibly considering additional variables to better understand what truly drives employee attrition in this specific context.

5.4 Analysis of compensation and turnover

Task 5.4. Analyzing compensation and turnover
  • Compare the average monthly income of employees who left the company (bi_attrition = 1) and those who stayed (bi_attrition = 0). Use the t.test function to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable called attrition_ttest_results.

  • Install the report package and use the report function to generate a report of the t-test results.

  • Install the ggstatsplot package and use the ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map the bi_attrition variable to the x argument and the salary variable to the y argument.

  • Visualize the salary variable for employees who left and those who stayed using geom_histogram with geom_freqpoly. Make sure to facet the plot by the bi_attrition variable and apply alpha on the histogram plot.

  • Provide recommendations on whether revising compensation policies could be an effective retention strategy.

## compare the average monthly income of employees who left and those who stayed
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)


## print the results of the t-test
print(attrition_ttest_results)

    Welch Two Sample t-test

data:  salary by bi_attrition
t = 19.074, df = 5557.5, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 39387.67 48411.52
sample estimates:
mean in group 0 mean in group 1 
      125856.35        81956.76 
## install the report package and use the report function to generate a report of the t-test results

attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)

report(attrition_ttest_results)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.26e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43899.59, 95% CI [39387.67, 48411.52], t(5557.53) = 19.07, p < .001; Cohen's d
= 0.51, 95% CI [0.46, 0.57])
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed


ggbetweenstats(data = hr_perf_dta, x = bi_attrition, y = salary, title = "Distribution of Salary for Employees Who Left vs. Stayed", xlab = "Attrition (Left vs. Stayed)", ylab = "Salary", type = "parametric")

# create histogram and frequency polygon of salary for employees who left and those who stayed
ggplot(hr_perf_dta, aes(x = salary)) +
  geom_histogram(aes(y = ..density..), binwidth = 5000, fill = "#ff1b6b", alpha = 0.4) +  # Adjust binwidth as needed
  geom_freqpoly(aes(y = ..density..), binwidth = 5000, color = "#45caff", size = 1) +
  facet_wrap(~ bi_attrition) +  labs(title = "Salary Distribution for Employees Who Left vs. Stayed",
    x = "Salary",
    y = "Density") +
  theme_minimal()

Discussion:

Both groups (left vs. stayed) have a similar pattern, with most employees earning lower salaries. There is a sharp peak around the lower salary range, and fewer employees earning higher salaries. The salary distribution of employees who stayed (left panel) and those who left (right panel) are very similar, indicating that salary alone may not be a strong differentiating factor in predicting whether an employee leaves or stays. There are a few employees in both groups who earn much higher salaries, though they are less frequent. The data suggests that most employees, regardless of whether they left or stayed, tend to fall within a lower salary range, with only a small portion earning significantly higher wages.

5.5 Employee satisfaction and performance analysis

Task 5.5. Analyzing employee satisfaction and performance
  • Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed. Use the group_by and count functions to calculate the average performance ratings for each group.

  • Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot. Use the ggplot function to create the plot and map the SelfRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Similarly, visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot. Make sure to map the ManagerRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition. Use the geom_boxplot function to create the plot and map the salary variable to the x argument, the job_satisfaction variable to the y argument, and the bi_attrition variable to the fill argument. You need to transform the job_satisfaction and bi_attrition variables into factors before creating the plot or within the ggplot function.

  • Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.

# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.

# Remove rows with NA values in Manager or Self Ratings, and apply the mapping
rating_manager_self_filtered <- hr_perf_dta %>%
  filter(!is.na(cat_manager_rating) & !is.na(cat_self_rating)) %>%  # Remove NA rows
  mutate( cat_manager_rating = as.numeric(factor(cat_manager_rating, levels = c("Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond"))),
    cat_self_rating = as.numeric(factor(cat_self_rating, levels = c("Needs improvement", "Meets expectation", "Exceeds expectation", "Above and beyond"))))

# Calculate average performance ratings
average_ratings_filtered <- rating_manager_self_filtered %>%
  group_by(bi_attrition) %>%
  summarize(
    Average_ManagerRating = mean(cat_manager_rating, na.rm = TRUE),  # Calculate average Manager Rating
    Average_SelfRating = mean(cat_self_rating, na.rm = TRUE),        # Calculate average Self Rating
  )

# View the average performance ratings
print(average_ratings_filtered)
# A tibble: 2 × 3
  bi_attrition Average_ManagerRating Average_SelfRating
         <dbl>                 <dbl>              <dbl>
1            0                  2.48               2.98
2            1                  2.46               2.99
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.
library(ggplot2)

# Create the bar chart with custom legend labels
ggplot(rating_manager_self_filtered, aes(x = factor(cat_self_rating, 
                                      levels = c(1, 2, 3, 4),  # Ensuring the correct order
                                      labels = c("Needs improvement", "Meets expectation", 
                                                 "Exceeds expectation", "Above and beyond")), 
                                      fill = factor(bi_attrition))) +
  geom_bar(position = "dodge", alpha = 0.7, color = "black") +  # Add black outlines to the bars
  scale_fill_manual(values = c("#FF0080", "#BC8F8F"), 
                    labels = c("Stayed", "Left")) +  # Custom colors and labels
  labs(title = "Distribution of Self Rating for Employees Who Stayed vs. Left", 
       x = "Self Rating", 
       y = "Count", 
       fill = "Attrition Status") +
  theme_minimal()

#
# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.

ggplot(rating_manager_self_filtered, aes(x = factor(cat_manager_rating, 
                                      levels = c(1, 2, 3, 4),  
                                      labels = c("Needs improvement", "Meets expectation", 
                                                 "Exceeds expectation", "Above and beyond")), 
                                      fill = factor(bi_attrition))) +
  geom_bar(position = "dodge", alpha = 0.7, color = "black") +  # Add black outlines to the bars
  scale_fill_manual(values = c("#E34234", "#FF9966"), 
                    labels = c("Stayed", "Left")) +  # Custom colors and labels
  labs(title = "Distribution of Manager Rating  for Employees Who Stayed vs. Left", 
       x = "Manager Rating", 
       y = "Count", 
       fill = "Attrition Status") +
  theme_minimal()

# Create the bar plot for ManagerRating distribution
# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.

# Create a boxplot of salary by job satisfaction and attrition status
ggplot(hr_perf_dta, aes(x = factor(cat_job_sat, 
                                    levels = c("Very dissatisfied", "Dissatisfied", 
                                               "Neutral", "Satisfied", "Very satisfied")), 
                         y = salary, 
                         fill = factor(bi_attrition))) +
  geom_boxplot(alpha = 0.7) +  # Add boxplots with some transparency
  scale_fill_manual(values = c("#ff8989", "#a9ff68"), 
                    labels = c("Stayed", "Left")) +  # Custom colors and labels
  labs(title = "Boxplot of Salary by Job Satisfaction and Attrition Status", 
       x = "Job Satisfaction", 
       y = "Salary", 
       fill = "Attrition Status") +
  theme_minimal()

Discussion:

The boxplot analysis shows that employees with higher salaries are more likely to stay, regardless of their job satisfaction level. In all satisfaction categories, those who left had lower median salaries than those who stayed. Attrition is particularly common among employees with lower job satisfaction and lower salaries, suggesting that dissatisfaction, when combined with low compensation, increases the likelihood of leaving. HR interventions should focus on re-evaluating compensation for lower-paid, dissatisfied employees, offering salary adjustments or retention bonuses to prevent attrition. Additionally, improving job satisfaction through engagement programs, career development, and work-life balance initiatives could help retain employees. Regular pulse surveys and exit interviews would also provide valuable insights to address dissatisfaction and attrition risks. Lastly, pay equity reviews should be conducted to ensure fair compensation across the organization.

5.6 Work-life balance and retention strategies

Task 5.6. Analyzing work-life balance and retention strategies

At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:

  • Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.

  • Use visualizations to show the differences.

  • Assess whether employees with poor work-life balance are more likely to leave.

You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.

# Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed
# Load necessary library
library(ggplot2)

# Plot the distribution of WorkLifeBalance ratings for employees who left vs. those who stayed
ggplot(hr_perf_dta, aes(x = factor(work_life_balance), fill = factor(bi_attrition))) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Work-Life Balance Ratings by Attrition Status", 
       x = "Work-Life Balance Rating", 
       y = "Count", 
       fill = "Attrition Status") +
  theme_minimal()

Discussion

The bar chart displays that employees with the poorest work-life balance (rating of 1) have a higher proportion of attrition compared to those with higher work-life balance ratings. Specifically, the number of employees who left (teal) is almost equal to those who stayed (red) for the lowest rating. In contrast, for higher ratings (2 to 5), the count of employees who stayed significantly exceeds those who left. This suggests a clear trend: employees with a poor work-life balance are more likely to leave the organization compared to those with better work-life balance ratings. Addressing work-life balance could therefore be a key strategy in reducing employee attrition.

Recommendations for HR interventions

```

6 Task 5.7. Recommendations for HR interventions

Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendations and discussions.

  • What are the key factors contributing to employee attrition in the company?

The key factors contributing to employee attrition in the company include age, department, job role, job satisfaction, salary, and work-life balance. Younger employees might leave more often for career changes or further education, while certain departments and job roles with high stress and low satisfaction see higher turnover. Lower job satisfaction and non-competitive salaries are strong indicators of attrition, prompting employees to seek better conditions. Lastly, poor work-life balance significantly drives employees to leave for roles offering better personal life integration. Addressing these factors by improving job satisfaction, offering competitive salaries, fostering a supportive work environment, and enhancing work-life balance can help reduce attrition rates and retain valuable employees.

  • Which factors are most strongly correlated with attrition?

The factors most strongly correlated with attrition in the company are the years at the company and salary. There is a significant negative correlation with years at the company, indicating that employees who have been with the company longer are less likely to leave. Additionally, there is a moderate negative correlation with salary, suggesting that higher salaries reduce the likelihood of attrition. These findings highlight that employees with longer tenure and higher salaries tend to remain with the company, pointing to the importance of recognizing and rewarding employee loyalty and competitive compensation to reduce attrition rates.

  • What strategies could be implemented to improve employee retention and satisfaction?

To improve employee retention and satisfaction, companies should focus on competitive compensation to reduce financial dissatisfaction and provide clear career development opportunities to help employees progress within the organization. Emphasizing work-life balance through flexible working hours, remote work options, and wellness programs can help employees manage personal and professional demands. Establishing recognition and reward systems fosters a culture of appreciation, while training managers to build strong, supportive relationships promotes a positive work environment. Finally, regularly conducting job satisfaction surveys and implementing changes based on feedback can proactively address issues, creating a more supportive and engaging workplace. These strategies collectively contribute to higher retention rates and overall employee satisfaction.

  • How can HR leverage the insights from the analysis to develop effective retention strategies?

HR can use these insights to design and implement targeted retention strategies. By understanding the key factors contributing to attrition—like poor work-life balance, low job satisfaction, and inadequate compensation—HR can focus on improving these areas.

For instance, they can introduce flexible working arrangements to enhance work-life balance, offer competitive salaries to match market standards, and provide clear career development paths to boost job satisfaction. Additionally, investing in managerial training to build supportive relationships and establishing employee recognition programs can further enhance workplace morale and retention. By addressing these specific pain points identified in the analysis, HR can create a more engaging and rewarding environment, thereby reducing attrition rates.

  • What are the potential benefits of implementing these strategies for the company?

Implementing these strategies can significantly enhance employee retention, leading to a more stable and experienced workforce. This stability can improve productivity and efficiency, as employees with longer tenure are generally more skilled and knowledgeable about the company’s processes. Additionally, a happier and more satisfied workforce tends to be more engaged and motivated, which can boost overall morale and foster a positive company culture. Competitive compensation and clear career development paths can also make the company more attractive to top talent, enhancing recruitment efforts. Ultimately, these benefits contribute to better organizational performance and a stronger bottom line.